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Abstract 

This  is  the  supplementary  material  for  Designing  Category-Level  Attributes  for  Dis¬ 
criminative  Visual  Recognition  [3|.  We  first  provide  an  overview  of  the  proposed  ap¬ 
proach  in  Section  [l]  The  proof  of  the  theorem  is  shown  in  Section  [2}  Additional 
remarks  of  the  proposed  attribute  design  algorithm  are  provided  in  Section  [3j  We 
show  additional  experiments  and  applications  of  the  designed  attributes  for  zero-shot 
learning  and  video  event  modeling  in  Section  [4j  Finally,  we  discuss  the  semantic  as¬ 
pects  of  automatic  attribute  design  in  Section  [5j  All  the  figures  in  this  technical  report 
are  best  viewed  in  color. 


1  Overview  of  the  Proposed  Approach 

Figure  [I]  provides  an  overview  of  the  proposed  approach. 

In  the  offline  phase,  given  a  set  of  images  with  labels  of  pre-defined  categories  (a 
multiclass  dataset),  our  approach  automatically  learns  a  category-attribute  matrix,  to 
define  the  category-level  attributes.  Then  a  set  of  attribute  classifiers  are  learned  based 
on  the  defined  attributes  (not  shown  in  the  figure).  Unlike  the  previous  work  i  ,  in 
which  both  the  attributes  and  the  category-attribute  matrix  are  pre-defined  (as  in  the 
“manually  defined  attributes”),  the  proposed  process  is  fully  automatic. 

In  the  online  phase,  given  an  image  from  the  novel  categories,  we  can  compute  the 
designed  category- level  attributes.  The  computed  values  of  three  attributes  (colored 
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Describing  the  designed  category-level  attributes 


Figure  1:  Overview  of  the  proposed  approach.  ®:  Designing  the  category-attribute  matrix.  ©: 
Computing  the  attributes  for  images  of  novel  categories. 


as  orange,  yellow  and  green)  are  shown  in  Figure  [I]  For  example,  the  first  image 
(of  a  raccoon)  has  positive  responses  of  the  orange  and  green  attributes,  and  negative 
response  of  the  yellow  attribute.  Because  the  category-level  attributes  are  defined  based 
on  a  category-attribute  matrix,  they  can  be  interpreted  as  the  relative  associations  with 
the  pre-defined  categories.  For  example,  the  orange  attribute  has  positive  associations 
with  mole  and  Siamese  cat,  and  negative  associations  with  killer  whale  and  blue  whale. 

The  category-level  attributes  are  more  intuitive  than  mid-level  representations  de¬ 
fined  on  low-level  features.  In  fact,  our  attributes  can  be  seen  as  soft  groupings  of 
categories,  with  analogy  to  the  idea  of  building  taxonomy  or  concept  hierarchy  in  the 
library  science.  We  will  further  discuss  the  semantic  aspects  of  the  proposed  method 
in  Section  [5] 
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2  Supplementary  of  the  Learning  Framework 


2.1  Proof  of  Theorem  1 

Theorem  1.  The  empirical  error  of  multi- class  classification  is  upper  bounded  by  2 e/ p. 

Proof.  Given  example  (x,  y),  we  assume  the  example  is  misclassified  as  some  category 
z  y,  meaning  that 


Then 


Ay  -  f(x)  II > 


Ay.  -  f (x)  ||>||  Az.  -  f (x)  ||  . 

Ay.  -  f(x)  ||  +  ||  Az.  -  f(x) 


From  triangle  inequality  and  the  definition  of  p: 

II  Ay.  -  f (x)  ||  +  ||  Az.  -  f (x)  ||>||  Ay.  -  Az.  ||>  p. 
So  we  know  misclassifying  (x,  y )  implies  that 

II  A y  ~  f(x)  ll>  ?■ 


(1) 

(2) 

(3) 

(4) 


Therefore  given  m  examples  (xi,  yi), ...,  (xm,  ym )  ,  the  number  of  category  recognition 
mistakes  we  make  is  at  most 


Si  II  Aa.  -  f  (Xj 

p/2 


2  me 


Thus  the  empirical  error  is  upper  bounded  by  2 e/ p. 


(5) 

□ 


3  Supplementary  of  the  Algorithm 

3.1  Parameters  of  the  Algorithm 

There  are  two  parameters  in  the  attribute  design  algorithm,  A  and  p.  Larger  A 
means  smaller  p  for  the  category-attribute  matrix,  and  larger  p  means  less  redundancy 
r  for  the  designed  attributes.  Figure  [2]  visualizes  the  influence  of  the  parameters  based 
on  a  randomly  generated  visual  proximity  matrix  S. 

4  Supplementary  of  the  Experiments 

4.1  Zero-shot  Learning 

Figure  [3]  visualizes  the  averaged  similarity  matrix  based  on  the  results  of  10  users. 
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Figure  2:  The  influence  of  the  two  parameters.  Left:  the  influence  of  A:  larger  A  means  smaller  p 
for  the  category-attribute  matrix.  Right:  the  influence  of  ?y:  larger  p  means  less  redundancy  r  for 
the  designed  attributes.  The  visual  proximity  matrix  S  used  for  this  figure  is  a  50x50  randomly 
generated  non-negative  symmetric  matrix. 
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Figure  3:  The  manually  built  visual  similarity  matrix.  It  characterize  the  visual  similarity  of 
the  10  novel  categories  and  the  40  known  categories.  This  matrix  is  obtained  by  averaging  the 
similarity  matrices  built  by  10  different  users.  Each  user  is  asked  to  build  a  visual  similarity 
matrix,  by  selecting  5  most  visually  similar  known  categories  for  each  novel  category.  The  selected 
elements  will  be  set  as  1,  and  others  as  0. 
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Figure  4:  Average  Precision  results  base  on  low-level  feature  and  attributes  for  full  exemplar  task 
of  TRECVID  MED  2012.  The  results  are  evaluated  on  the  internal  threshold  split  containing  20% 
of  the  training  data.  Linear  SVMs  are  used  for  event  modeling.  The  same  low-level  features  are 
used  for  training  attributes. 


4.2  Designing  Attributes  for  Video  Event  Modeling 

We  show  one  additional  application  of  using  attributes  for  video  event  classification 
on  the  TRECVID  2012  MED  task. 

Traditionally,  the  semantic  features  for  video  event  modeling  are  learned  from  the 
taxonomy  with  the  labeled  images  |1  .  The  taxonomy  is  manually  defined  based  on 
expert  knowledge,  and  a  set  of  images  must  be  labeled  by  human  experts.  Similar 
to  the  manually  specified  attributes,  the  semantic  features  suffer  from  the  following 
problems. 

•  The  human  defining  and  labeling  processes  are  very  expensive,  especially  if  we 
need  large- amount  of  concepts,  with  enough  clean  training  data. 

•  Though  the  taxonomy  is  semantically  plausible,  it  may  not  be  consistent  to  the  vi¬ 
sual  feature  distributions.  Consequently,  some  dimensions  of  the  semantic  feature 
vector  are  difficult  to  be  modeled. 

Motivated  by  the  above  facts,  we  use  the  proposed  category-level  attributes  as  a 
data-consistent  way  of  modeling  “semantics” .  Specifically,  we  design  attributes  based 
on  518  leaf  nodes  of  the  taxonomy  |1|  (as  the  known  categories). 

To  test  the  performance  of  the  proposed  approach,  we  have  trained  and  extracted 
2,500  dimensional  attribute  feature  for  the  pre-specified  task  of  MED.  Figure  [4]  shows 
the  performance  of  the  low-level  feature  and  the  proposed  attribute  feature.  Impres¬ 
sively,  attributes  have  achieved  relative  performance  gain  over  60%,  improving  the 
mAP  from  0.075  to  0.123. 
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■ 

mole,  Siamese  cat,  persian  cat 

killer  whale,  blue  whale,  seal 

■ 

gorilla,  humpback  whale,  chimpanzee 

giraffe,  antelope,  zebra 

■ 

bat,  mouse,  hamster 

tiger,  zebra,  antelope 

Figure  5:  Using  category-level  attributes  to  describe  images  of  novel  categories.  In  the  table 
below,  three  attributes  are  described  in  terms  of  the  corresponding  top  positive/negative  known 
categories  in  the  category-attribute  matrix.  Some  designed  attributes  can  be  further  interpreted 
by  concise  names:  the  first  two  can  be  described  as  small  land  animals  vs.  ocean  animals,  black 
vs.  non/partial-black.  Some  may  not  be  interpreted  concisely:  the  third  one  is  like  like  rodent  vs. 
tiger  and  cloven  hoof  animals.  The  figure  above  shows  the  computed  attribute  values  for  images  of 
novel  categories. 


5  Discussions  about  Semantics 

5.1  Interpretations  of  the  Category-Level  Attributes 

One  unique  advantage  of  the  designed  attributes  is  that  they  can  provide  inter¬ 
pretable  cues  for  visualizing  the  machine  reasoning  process.  In  other  words,  the  de¬ 
signed  attributes  can  be  used  to  answer  not  only  “what”,  but  also  “why”  one  image 
is  recognized  as  certain  category.  First,  the  attributes  are  designed  on  category  lev¬ 
el,  the  descriptions  are  readily  available  through  weighted  categories  names  ( e.g .,  the 
attribute  that  has  high  association  with  polar  bear,  and  low  association  with  walrus, 
lion).  Second,  the  regularization  term  ./2(A)  in  the  attribute  design  formulation  can 
in  fact  lead  to  human  interpretable  attributes,  by  inducing  “similar”  categories  NOT 
to  be  far  away  in  attribute  space. 

Some  examples  of  using  the  computed  attributes  to  describe  the  images  of  novel 
categories  are  shown  in  Figure  [5] 

5.2  Designing  Semantic  Attributes 

Note  that  not  all  attributes  designed  can  be  semantically  interpreted.  We  discuss 
one  possible  way  of  enhancing  the  semantics  in  the  attribute  designing  process,  with 
the  help  of  human  interactions. 

The  solution  is  to  modify  the  attribute  design  algorithm  with  an  additional  semantic 
projection  step:  after  getting  each  a  (a  column  of  the  category-attribute  matrix),  make 
some  changes  to  a  to  get  99(a),  such  that  <£>(  a)  is  semantically  meaningful.  Figure  [b] 
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Figure  6:  Semantic  projection  for  designing  attributes  with  stronger  semantics. 


shows  an  example  of  semantic  projection  (by  human).  In  this  example,  by  changing  a 
to  <^(a),  the  designed  attribute  can  be  easily  interpreted  as  “water  dwelling  animals”. 

Specifically,  given  an  initial  pool  of  pre-defined  attributes,  together  with  their  man¬ 
ually  specified  category-attribute  matrix,  we  can  define  some  rules  of  what  kinds  of 
category-level  attributes  are  semantically  meaningful.  For  instance,  it  is  intuitive  to 
say  the  union  (black  or  white),  intersection  (black  and  white),  and  subset  (chimpanzee 
kinds  of  black,  attributes  are  often  category-dependent)  of  the  manually  defined  at¬ 
tributes  are  semantically  interpretable.  The  operations  of  union,  intersection  etc.  can 
be  modeled  by  operations  on  the  manually  specified  category-attribute  matrix.  The 
designed  attributes  can  then  be  projected  to  the  nearest  semantic  candidate: 

99(a)  =  arg  mina/g_4  ||  a7  -  a  ||,  (6) 

in  which  A  is  the  semantic  space  defined  by  rules.  This  method  can  be  used  to  efficiently 
expand  the  predefined  semantic  attributes.  We  will  study  this  in  our  future  work. 
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